Statistical learning: classification problems

MACS 30500
University of Chicago

February 8, 2017

Titanic

Sinking of the Titanic

Titanic

Titanic

Titanic (1997)

Get data

## Classes 'tbl_df', 'tbl' and 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Linear regression

Linear regression

Logistic regression

\[P(\text{survival} = \text{Yes} | \text{age})\]

  • Predicted probability of surviving

Logistic regression

survive_age <- glm(Survived ~ Age, data = titanic, family = binomial)
summary(survive_age)
## 
## Call:
## glm(formula = Survived ~ Age, family = binomial, data = titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1488  -1.0361  -0.9544   1.3159   1.5908  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.05672    0.17358  -0.327   0.7438  
## Age         -0.01096    0.00533  -2.057   0.0397 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 964.52  on 713  degrees of freedom
## Residual deviance: 960.23  on 712  degrees of freedom
##   (177 observations deleted due to missingness)
## AIC: 964.23
## 
## Number of Fisher Scoring iterations: 4

Logistic regression

Logistic regression

Multiple predictors

## 
## Call:
## glm(formula = Survived ~ Age + Sex, family = binomial, data = titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7405  -0.6885  -0.6558   0.7533   1.8989  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.277273   0.230169   5.549 2.87e-08 ***
## Age         -0.005426   0.006310  -0.860     0.39    
## Sexmale     -2.465920   0.185384 -13.302  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 964.52  on 713  degrees of freedom
## Residual deviance: 749.96  on 711  degrees of freedom
##   (177 observations deleted due to missingness)
## AIC: 755.96
## 
## Number of Fisher Scoring iterations: 4

Multiple predictors

Quadratic terms

Quadratic terms

Quadratic terms

Quadratic terms

Interactive terms

\[f = \beta_{0} + \beta_{1}\text{age} + \beta_{2}\text{gender}\]

Interactive terms

\[f = \beta_{0} + \beta_{1}\text{age} + \beta_{2}\text{gender} + \beta_{3}(\text{age} \times \text{gender})\]

Interactive terms

\[f = \beta_{0} + \beta_{1}\text{age} + \beta_{2}\text{gender}\]

\[f = \beta_{0} + \beta_{1}\text{age} + \beta_{2}\text{gender} + \beta_{3}(\text{age} \times \text{gender})\]

Interactive terms

Accuracy of predictions

age_accuracy <- titanic %>%
  add_predictions(survive_age) %>%
  mutate(pred = logit2prob(pred),
         pred = as.numeric(pred > .5))

mean(age_accuracy$Survived == age_accuracy$pred, na.rm = TRUE)
## [1] 0.5938375

Accuracy of predictions

x_accuracy <- titanic %>%
  add_predictions(survive_age_woman_x) %>%
  mutate(pred = logit2prob(pred),
         pred = as.numeric(pred > .5))

mean(x_accuracy$Survived == x_accuracy$pred, na.rm = TRUE)
## [1] 0.780112

Training/test sets

  • Training set
  • Test set
  • Why split up?

Test set for Titanic

titanic_split <- resample_partition(titanic, c(test = 0.3, train = 0.7))
map(titanic_split, dim)
## $test
## [1] 267  12
## 
## $train
## [1] 624  12
train_model <- glm(Survived ~ Age * Sex, data = titanic_split$train,
                   family = binomial)
summary(train_model)
## 
## Call:
## glm(formula = Survived ~ Age * Sex, family = binomial, data = titanic_split$train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9937  -0.7035  -0.5694   0.7141   2.2576  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.79748    0.35704   2.234  0.02551 * 
## Age          0.01829    0.01259   1.453  0.14632   
## Sexmale     -1.58501    0.48589  -3.262  0.00111 **
## Age:Sexmale -0.03928    0.01661  -2.365  0.01801 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 677.69  on 495  degrees of freedom
## Residual deviance: 500.36  on 492  degrees of freedom
##   (128 observations deleted due to missingness)
## AIC: 508.36
## 
## Number of Fisher Scoring iterations: 4
x_test_accuracy <- titanic_split$test %>%
  tbl_df() %>%
  add_predictions(train_model) %>%
  mutate(pred = logit2prob(pred),
         pred = as.numeric(pred > .5))

mean(x_test_accuracy$Survived == x_test_accuracy$pred, na.rm = TRUE)
## [1] 0.7522936

Exercise: depression and voting

Complete this exercise

Should I Have a Cookie?

Interpreting a decision tree

Benefits/drawbacks to decision trees

  • Easy to explain
  • Easy to interpret/visualize
  • Good for qualitative predictors
  • Lower accuracy rates
  • Non-robust

Random forests

Random forests

  • Bootstrapping
  • Reduces variance
  • Bagging
  • Random forest
    • Reliability

The caret library

Estimating statistical models using caret

Exercise: depression and voting

Complete this exercise